U.S. Census data is one of the predominate sources for information on population size, particularly when disaggregated by age, sex, race/ethnicity or other demographic and economic variables. Moreover, it is one of the only datasets providing. nationally-representative demographic and economic information on households and housing units. Data is collected every 10 years via the Decennial Census of Population and Housing, but the American Community Surve (ACS) is an annual survey of households designed to be nationally representative that collects more in-depth information about households and their members than the decennial Census.
It would be impossible to summarize the full array of questions asked in the ACS, but one particularly relevant question is related to household member migration:
Question 15a. Did this person live in this house or apartment 1 year ago? Possible answers:
Question 15b. Where did this person live 1 year ago? Answer fields are: Address, Name of city, town, or post office, Name of U.S. county or municipio in Puerto Rico, Name of U.S. state or Puerto Rico, and ZIP Code.
Other Census datasets include the Population Estimates Program (annual post-censal population and housing units estimates for each state, county, and jurisdiction), the Current Population Survey (ongoing surveys on household membership and labor force participation), and many datasets on economic indicators.
Census data is available at a wide array of spatial resolution, and availability varies by data source. Figure 1 shows the hierarchy of Census geographies. Those geographies that can be built from the basic Census geographic unit, the Census block, like along the vertical “geographic spine”. Certain data products are also produced for some of the “off-spine” geographies in Figure 1.
Decennial Census data is available at all geographies on the spine, but the introduction of differential privacy in the 2020 Decennial Census results in published counts that have more uncertainty at the smallest geographies of the spine. 2020 counts for the nation, states and large counties are reported with no uncertianty.
Aggregated, area-level ACS data is available annually at the national, state, and large county or city levels (65,000 or more). Uncertainty in these estimates arise from the sampling variability inherent in the household sampling scheme. Though data is collected annually, data is only shared publicly at the Census tract and block group level using 5-years of data. Sub-samples of ACS microdata, or person-level data, can be accessed for larger levels of geographies as well.
Many Census data tables are available for download as Excel files or .csv’s via the Census website and through this interactive Census website. However, accessing this data is not always reproducible. Moreover, you often need to download data for bigger geographic areas than may be of interest to you. Many packages and services exist to incorporate the downloading and accessing of specific variables for specific geographies into a reproducible workflow within R or Python. In addition, these packages often do much of the necessary “heavy-lifting” required to get the data into an easy-to-use format for other data science, GIS, or statistical programming tasks.
This tutorial will provide you with the necessary tools to access the Census data you need in a reproducible and simple way.
To be drafted.
censusapiThe Census Application Programming Interface (API) is a tool built by the Census Bureau to allow for the downloading of public use Census data in an easy and systematic way. The Census API can be queried by anyone with the internet and a Census API Key (a unique-to-you alphanumeric code issued by the Census Bureau). It is constructed so you can specify the dataset and year you want, the specific variables you want within that dataset, and the geography for which you want the information within a formulaic URL in a relatively intuitive way. (See examples.) Though you can type these URLs into your internet browser, using functions within R or Python to visit certain URLs that query the API allow for a reproducible workflow. This online Census Bureau course is useful for those interested in learning more about the Census API
The censusapi package (CRAN,
GitHub, Author website) is an R
package written and maintained by Hannah Recht that provides a suite of
functions that facilitate the construction and implementation of Census
API URL queries for more than 1,500 Census API endpoints. This package
is an incredible tool, but there are better options if you are mainly
interested in decennial Census and ACS data.
ipumsr… [Tyler, do you want to write this section? I’m happy to if not! Also, do we move tigris here?]
tidycensusThe tidycensus package (CRAN, GitHub, Author website) is an R
package written and maintained by Kyle Walker. The package provides
access to tabular decennial Census, ACS, ACS Migration Flows, and
Population Estimates datasets at both on- and off-spine Census
geographies. In addition to taking care of the logistics of querying the
API for you with easy-to-use functions, the tidycensus
package wrangles the data into a nice format, providing spatial data
objects when requested.
The remainder of this tutorial will focus on using
tidycensus.
To use either the censusapi or tidycensus
packages you need to request your Census
API key. This process is quick! Your own personal API key will be
delivered to the email address provided.
One challenge with using these packages in a reproducible and group-based workflow is the management of API keys. Each member of the team who will be using any code that queries the Census APIs needs their own API key. API keys are not meant to be shared.
One nice work around is to have everyone in a group create their own
R script such as the file CensusAPIKey.R in Figure 2. This purpose will
assign an individual’s Census API code to a character object with a
common name. In this example, everyone’s API character string is
assigned to an object called myKey.
When initializing your group’s GitHub repository, add the name of
this file to your .gitignore file as in Figure 3.
NB: It’s best to do this before adding the CensusAPIKey.R to
your repository.
Once your .gitignore is modified and pushed, every team
member can have their own CensusAPIKey.R file on their own machine that
is never pushed to GitHub. Your team’s code can be written in such a way
that if CensusAPIKey.R is sourced in a script, every time an API key is
needed a reference to the object myKey will be sufficient
regardless of which team member is executing code. See example using
tidycensus below.
## Example works when CensusAPIKey.R is stored in
## the same filepath as the .R script being ran
## pointed to by rstudioapi::getActiveDocumentContext()$path
# Setup ####
## setwd() ####
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
## Libraries ####
library(tidycensus)
### Source Census API Key ####
source("CensusAPIKey.R")
### Set session API key in tidycensus ####
census_api_key(key = myKey)
tidycensusThe tidycensus package is superbly developed and
maintained. Moreover, Kyle Walker has developed so much material
to aid tidycensus users including an entire textbook, Analyzing US Census
Data: Methods, Maps, and Models in R. For questions not answered in
this tutorial, this book is a great place to start.
To access Census data via tidycensus you need to know
the following information:
In the subsections below, we briefly discuss the selection process
and how it relates to the tidycensus functions for querying
Census APIs briefly. Following that, we provide full start-to-finish
examples of querying both the decennial Census and ACS.
The dataset you use will determine the name of the
tidycensus function used to query the Census API.
One access decennial Census data via the get_decennial()
function and ACS data via the get_acs() function.
In tidycensus variables can be identified in the output
of the load_variables() function by their
name, label, and concept. A
concept such as SEX BY AGE below is shared by
many variables each of which are uniquely identified by an alphanumeric
string, or their name. Variables are also uniquely
identified by the combination of their label and their
concept.
In the example below, all variables in the 2021 ACS within the
SEX BY AGE concept have a name
that begins with B01001. These variables contain estimates
and margins of error for population counts by sex (Male, Female) and age
categories (0-5, 5-9, 10-14, 15-17, 18-19, 20, 21, 22-24, 25-29, 30-34,
35-39, 40-44, 45-49, 50-54, 55-59, 60-61, 62-64, 65-66, 67-69, 70-74,
75-79, 80-84, and 85+.) Totals are provided for each sex and the
population overall (e.g. B01001_001 and
B01001_002 below.)
load_variables(year = 2021, dataset = "acs1") %>%
filter(concept == "SEX BY AGE") %>% head()
## # A tibble: 6 × 3
## name label concept
## <chr> <chr> <chr>
## 1 B01001_001 Estimate!!Total: SEX BY AGE
## 2 B01001_002 Estimate!!Total:!!Male: SEX BY AGE
## 3 B01001_003 Estimate!!Total:!!Male:!!Under 5 years SEX BY AGE
## 4 B01001_004 Estimate!!Total:!!Male:!!5 to 9 years SEX BY AGE
## 5 B01001_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE
## 6 B01001_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE
A variable’s name is passed as a character string to the
variables argument of the get_acs() or
get_decennial() functions,
e.g. variables = "B01001_001". A character vector can be
passed if you would like to select more than one variable in the same
API query,
e.g. variables = c("B01001_001", "B01001_002").
There is also a table argument that allows you to select
all variables from the same concept. For example, to select
all variables from the SEX BY AGE concept, one would
specify table = "B01001" and not specify anything for the
variable argument.
[to draft]
Both get_acs() and get_decennial() have a
year argument. For the decennial Census and ACS 1-year
data, the argument is simply the year for which you want data. However,
for ACS 5-year data, the year argument still takes only one
value and that value is the endpoint of the 5-year interval.
For example, specifying year = 2019 for ACS 5-year data
will return estimates produced using ACS data collected in years
2015-2019.
Both get_decennial() and get_acs() require
one to specify the geography argument indicating the
geographic level (e.g. national, county,
tract, etc.) at which data is requested. The tidycensus
Basic Usage vignette contains a list of the Census geographies and
how they should be specified in tidycensus
Once you know your dataset of interest AND your geographic scale of interest, you are ready to specify your temporal scale.
If you are using the decennial Census, you will be
requesting data at the single year scale regardless of your
geography. In the get_decennial() function you
must set your sumfile argument to the appropriate Census
summary file for that year (2000, 2010, or 2020). Find appropriate
sumfile arguments for each decennial Census in Section
2.1.1.1 of Kyle Walker’s book.
If you are using the ACS, you will be requesting data at
either the 1-year or 5-year scale depending upon your choice of
geography. In the get_acs() function you
must set your survey argument to either acs1
for 1-year data or acs5 for 5-year data. For national,
state, and large counties or cities (Census places), you can usually
specify survey = "acs1". For small counties or places,
tracts, or block groups you will almost certainly need to specify
survey = "acs5". However, even at the state or county
level, there may be some variables, e.g. population counts by age, sex,
and race for a race category in the Census, that have a small count in
that area and will require the use of 5-year data.
get_decennial()Below we provide code for pulling counts of total housing units for each county in the state of Washington from the 2020 decennial Census using the following information:
sumfile = pl,variables = "H1_001N",year = 2020,geography = "county" and
state = "WA",key = mykey.hu_wa_2020 <- get_decennial(geography = "county",
variables = "H1_001N",
year = 2020,
sumfile = "pl",
state = "WA",
key = myKey)
head(hu_wa_2020)
## # A tibble: 6 × 4
## GEOID NAME variable value
## <chr> <chr> <chr> <dbl>
## 1 53001 Adams County, Washington H1_001N 6735
## 2 53003 Asotin County, Washington H1_001N 10034
## 3 53005 Benton County, Washington H1_001N 80076
## 4 53007 Chelan County, Washington H1_001N 37267
## 5 53009 Clallam County, Washington H1_001N 37930
## 6 53011 Clark County, Washington H1_001N 195036
GEOIDsThe GEOID column of output includes the standard Census
numeric code for a particular geography. States and counties are
determined by their Federal
Information Processing Series(FIPS) codes.
The state code is 2 digits long, Washington’s FIPS
code is 53, and we see each GEOID above begins with those
digits.
The county code within each state is 3 digits long.
Adams County’s unique code within Washington is 001. So, the full
GEOID for Adams County, Washington is 53001.
The tract code that uniquely identifies a Census
tract within a county is 6 digits long. The resulting GEOID
for a tract in Adams County, Washington is then 53001TTTTTT, where
TTTTTT represents a 6-digit tract code.
The block group code that uniquely identifies each
Census block group within a certain Census tract, TTTTTT, within Adams
County, Washington is 1 digit long. The resulting GEOID for
a block group within tract TTTTTT in Adams County, Washington is
53001TTTTTTG, where G represents a 1 digit tract code.
The block code that uniquely identifies each Census
block within a block group, G, within a certain Census tract, TTTTTT,
within Adams County, Washington is 3 digits long. The resulting
GEOID for a block group within tract TTTTTT and block group
G in Adams County, Washington is 53001TTTTTTGBBB, where BBB represents a
3 digit tract code.
get_acs()Below we provide code for pulling counts of counts of individuals by sex for King County, Washington from the 2021 ACS using the following information:
variables = c("B01001_002", "B01001_025"),year = 2021,geography = "county" and
state = "WA" and county = "King",survey = "acs1"key = mykey.pop_sex_king_2021 <- get_acs(geography = "county",
variables = c("B01001_002", "B01001_026"),
year = 2021,
survey = "acs1",
state = "WA",
county = "King",
key = myKey)
head(pop_sex_king_2021)
## # A tibble: 2 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 53033 King County, Washington B01001_002 1140802 552
## 2 53033 King County, Washington B01001_026 1111503 552
While the output of get_decennial() only has a fourth
column called value after GEOID,
NAME, and variable, ACS output has two
columns: estimate and moe.
NB: The moe represents the whole margin of error
for a 90% (by default) confidence interval (CI). In other words,
estimate + moe will give you the upper limit of a 90% CI and estimate -
moe will give you the lower limit.
To calculate the standard error for each estimate when the default 90% margin of error is used, use the following code:
pop_sex_king_2021 <- pop_sex_king_2021 %>%
mutate(se = moe/qnorm(.95))
Below we provide code for pulling counts of individuals by sex and age for King County, Washington from the 2021 ACS using the following information:
table = "B01001",year = 2021,geography = "county" and
state = "WA" and county = "King",survey = "acs1"key = mykey.NB: If we wanted info for a county with a population less
than 65,000 we would need to get 5-year data for 2017-2021 using
survey = "acs5".
pop_sex_age_king_2021 <- get_acs(geography = "county",
table = "B01001",
year = 2021,
survey = "acs1",
state = "WA",
county = "King",
key = myKey)
head(pop_sex_age_king_2021)
## # A tibble: 6 × 5
## GEOID NAME variable estimate moe
## <chr> <chr> <chr> <dbl> <dbl>
## 1 53033 King County, Washington B01001_001 2252305 NA
## 2 53033 King County, Washington B01001_002 1140802 552
## 3 53033 King County, Washington B01001_003 61208 521
## 4 53033 King County, Washington B01001_004 62450 3251
## 5 53033 King County, Washington B01001_005 66507 3251
## 6 53033 King County, Washington B01001_006 37342 367
Now that we’ve walked through querying the Census API for the
decennial Census and the ACS via tidycensus we need to
cover some basics to make this data more usable. First, and foremost, no
one will ever be able to remember what every alphanumeric Census
variable name is by heart. But, if we use the output of
load_variables() coupled with some data-wrangling, we can
get neat string labels in no time!
Step 1: Save the output of
load_variables()
acs_vars_2021 <- load_variables(year = 2021, dataset = "acs1")
head(acs_vars_2021)
## # A tibble: 6 × 3
## name label concept
## <chr> <chr> <chr>
## 1 B01001A_001 Estimate!!Total: SEX BY AGE (WHITE ALONE)
## 2 B01001A_002 Estimate!!Total:!!Male: SEX BY AGE (WHITE ALONE)
## 3 B01001A_003 Estimate!!Total:!!Male:!!Under 5 years SEX BY AGE (WHITE ALONE)
## 4 B01001A_004 Estimate!!Total:!!Male:!!5 to 9 years SEX BY AGE (WHITE ALONE)
## 5 B01001A_005 Estimate!!Total:!!Male:!!10 to 14 years SEX BY AGE (WHITE ALONE)
## 6 B01001A_006 Estimate!!Total:!!Male:!!15 to 17 years SEX BY AGE (WHITE ALONE)
Step 2: Join load_variables() output with
get_*() output Below, we join
acs_vars_2021 onto pop_sex_age_king_2021 by
the respective variable name column in each dataset. This still leaves
the pesky issue of the “!!” and “:” in the label strings.
pop_sex_age_king_2021 <- pop_sex_age_king_2021 %>%
left_join(acs_vars_2021 %>%
select(name, label),
by = c("variable" = "name")) %>%
relocate("label", .after = "variable")
head(pop_sex_age_king_2021)
## # A tibble: 6 × 6
## GEOID NAME variable label estimate moe
## <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 53033 King County, Washington B01001_001 Estimate!!Total: 2252305 NA
## 2 53033 King County, Washington B01001_002 Estimate!!Total:!!Mal… 1140802 552
## 3 53033 King County, Washington B01001_003 Estimate!!Total:!!Mal… 61208 521
## 4 53033 King County, Washington B01001_004 Estimate!!Total:!!Mal… 62450 3251
## 5 53033 King County, Washington B01001_005 Estimate!!Total:!!Mal… 66507 3251
## 6 53033 King County, Washington B01001_006 Estimate!!Total:!!Mal… 37342 367
Step 3: Clean up the variable labels How you want to
do this will vary for each concept, this specific example is appropriate
for the “SEX BY AGE” concept. The result of this code is extra
sex and age columns added to the ACS data
making it easier to plot and analyze.
pop_sex_age_king_2021 <- pop_sex_age_king_2021 %>%
## Replace the starting "Estimate!!Total:"
## with an empty string everywhere
mutate(label = gsub("Estimate!!Total:", "", label),
## Replace all remaining !!
label = gsub("!!", "", label)) %>%
## The only special characters left now are the :
## between sex and age groups
## we will split the string in each row on the :
## and retain the first element for sex and the second for label
group_by(variable) %>%
mutate(sex = unlist(strsplit(label, ":"))[1],
age = unlist(strsplit(label, ":"))[2]) %>%
ungroup() %>%
## Fill in "Total" strings where sex or age are missing
mutate(sex = ifelse(is.na(sex), "Total", sex),
age = ifelse(is.na(age), "Total", age)) %>%
relocate(c("sex", "age"), .after = "label")
## Check our work
pop_sex_age_king_2021 %>%
select(variable, label, sex, age) %>%
head()
## # A tibble: 6 × 4
## variable label sex age
## <chr> <chr> <chr> <chr>
## 1 B01001_001 "" Total Total
## 2 B01001_002 "Male:" Male Total
## 3 B01001_003 "Male:Under 5 years" Male Under 5 years
## 4 B01001_004 "Male:5 to 9 years" Male 5 to 9 years
## 5 B01001_005 "Male:10 to 14 years" Male 10 to 14 years
## 6 B01001_006 "Male:15 to 17 years" Male 15 to 17 years
[to draft a kableExtra example]
[to draft a population line chart estimate across time]
[to draft]
Census and ACS data are often associated with inherent geographies, which are units at which the data are aggregated. The defined geographies are represented in the U.S. Census Bureau’s TIGER/Line database. TIGER stands for Topologically Integrated Geographic Encoding and Referencing and can be simply thought of as the official geographic spatial data of the U.S. Census Bureau.
Spatial data sets are made available as shapefiles and include three general types of data:
Legal entities (e.g. states and counties)
Statistical entities (e.g. Census tracts and block groups)
Geographic features (e.g. roads and water features)
A hierarchy of U.S. Census Bureau geographic data can be found here.
To work with Census geographic data, we suggest the use of the tigris package, which simplifies the process for R users of obtaining and using Census geographic data sets.
Each type of geographic dataset available in the Census Bureau’s TIGER/Line database is available with a function in tigris. For example, the states() function can be run without arguments to download a boundary file of US states and state equivalents.
library(tigris)
st <- states(progress_bar = FALSE)
tigris defaults to the most recent year for which a complete set of Census shapefiles are available, which—right now—is 2021.
Investigating the st object, we see a sf class representing all US states and territories, includes a data frame with a series of columns representing characteristics of those states, like a name, postal code, and Census ID (the GEOID column).
The geometry of the st object can be seen using the plot() function.
plot(st$geometry)
Other Census datasets may be available by state or by county within the state. In some cases. For example, the counties() function can be used to obtain county boundaries for the entirety of the United States, but also can be used with the state argument to return only those counties from a specific state, like Washington.
wa_counties <- counties("WA", progress_bar = FALSE)
plot(wa_counties$geometry)
States and counties are examples of legal entities that can be accessed with tigris. Statistical entities and geographic features are similarly accessed if they exist in the TIGER/Line database. For example, a user might request Census tract boundaries for a given county in Washington with the corresponding tracts() function.
seattle_tracts <- tracts("WA", "King", progress_bar = FALSE)
plot(seattle_tracts$geometry)
R includes a number of options outside of the plot() function for quick visualization of geographic data that will be useful for geospatial analysis. For ease and inutitiveness, we suggest the ggplot2 or tmap packages.
At a basic level, a couple lines of ggplot2 code are all that are needed to plot Census shapes obtained with tigris. For example, taking a look at the King County Census tracts:
library(ggplot2)
ggplot(seattle_tracts) +
geom_sf()
By default, ggplot2 includes a standard grey grid with latitude and longitude values displayed along the axes. For many cartographic applications, we would prefer to remove this background information. The theme_minimal() function strips the background grid, while keeping axis labels from the plot accordingly:
ggplot(seattle_tracts) +
geom_sf() +
theme_minimal()
In addition to the traditional Tiger/Line shapefiles, the Census Bureau also makes available cartographic boundary shapefiles. These files are derived from the TIGER/Line shapefiles but are generalized in the interior and clipped to the shoreline of the United States, making them a better choice in many cases than the TIGER/Line shapefiles for thematic mapping. This is especially important when dealing with states that have some connection to water/international boundaries.
For our Washington counties data set, we would want to run the following code accounting for the cb = TRUE argument:
wa_counties_cb <- counties("WA", cb = TRUE, progress_bar = FALSE)
plot(wa_counties$geometry)
Look at the difference below:
library(patchwork)
wa_tiger_gg <- ggplot(wa_counties) +
geom_sf() +
theme_minimal() +
labs(title = "TIGER/Line")
wa_cb_gg <- ggplot(wa_counties_cb) +
geom_sf() +
theme_minimal() +
labs(title = "Cartographic boundary")
wa_tiger_gg + wa_cb_gg
For geographic data to appropriately represent locations in mapping and geospatial analysis, they must be referenced to some model of the Earth’s surface correctly. We define this representative model of Earth with a coordinate reference system (CRS), which specifies not only how data coordinates should be mapped to a model of the Earth’s surface but also how measurements should be computed using a given data set.
By default, data sets returned by tigris are stored in a geographic coordinate system, in which coordinates are represented as longitude and latitude relative to a three-dimensional model of the earth. The st_crs() function in the sf package helps us check the CRS of our data.
For example, let’s take another look at our Washington state data.
library(sf)
st_crs(wa_counties)
## Coordinate Reference System:
## User input: NAD83
## wkt:
## GEOGCRS["NAD83",
## DATUM["North American Datum 1983",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## CS[ellipsoidal,2],
## AXIS["latitude",north,
## ORDER[1],
## ANGLEUNIT["degree",0.0174532925199433]],
## AXIS["longitude",east,
## ORDER[2],
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",4269]]
All Census Bureau data sets are stored in the “NAD83” geographic coordinate system, which refers to the North American Datum of 1983. Other relevant information includes the ellipsoid used (GRS 1980, which is a generalized three-dimensional model of the Earth’s shape), the prime meridian of the CRS (Greenwich is used here), and the EPSG (European Petroleum Survey Group) ID of 4269, which is a special code that can be used to represent the CRS in more concise terms.
Thousands of projected CRSs exist in thew world. While it can be a challenge to decide on the right projected CRS for your data, the crsuggest package can help narrow down the choices.
For example, for the best mapping of our Washington counties data, we could choose:
library(crsuggest)
wa_crs <- suggest_crs(wa_counties_cb)
The “best choice” is the CRS “NAD83(2011) / Washington North” coordinate reference system, which is available with two different variations on the NAD1983 datum. Let’s choose the second entry, “NAD83(2011) / Washington North”, which is given in meters rather than U.S. feet. Coordinate reference system transformations in sf are done using the st_transform() function.
wa_projected <- st_transform(wa_counties_cb, crs = 6596)
st_crs(wa_projected)
## Coordinate Reference System:
## User input: EPSG:6596
## wkt:
## PROJCRS["NAD83(2011) / Washington North",
## BASEGEOGCRS["NAD83(2011)",
## DATUM["NAD83 (National Spatial Reference System 2011)",
## ELLIPSOID["GRS 1980",6378137,298.257222101,
## LENGTHUNIT["metre",1]]],
## PRIMEM["Greenwich",0,
## ANGLEUNIT["degree",0.0174532925199433]],
## ID["EPSG",6318]],
## CONVERSION["SPCS83 Washington North zone (meters)",
## METHOD["Lambert Conic Conformal (2SP)",
## ID["EPSG",9802]],
## PARAMETER["Latitude of false origin",47,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8821]],
## PARAMETER["Longitude of false origin",-120.833333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8822]],
## PARAMETER["Latitude of 1st standard parallel",48.7333333333333,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8823]],
## PARAMETER["Latitude of 2nd standard parallel",47.5,
## ANGLEUNIT["degree",0.0174532925199433],
## ID["EPSG",8824]],
## PARAMETER["Easting at false origin",500000,
## LENGTHUNIT["metre",1],
## ID["EPSG",8826]],
## PARAMETER["Northing at false origin",0,
## LENGTHUNIT["metre",1],
## ID["EPSG",8827]]],
## CS[Cartesian,2],
## AXIS["easting (X)",east,
## ORDER[1],
## LENGTHUNIT["metre",1]],
## AXIS["northing (Y)",north,
## ORDER[2],
## LENGTHUNIT["metre",1]],
## USAGE[
## SCOPE["Engineering survey, topographic mapping."],
## AREA["United States (USA) - Washington - counties of Chelan; Clallam; Douglas; Ferry; Grant north of approximately 47°30'N; Island; Jefferson; King; Kitsap; Lincoln; Okanogan; Pend Oreille; San Juan; Skagit; Snohomish; Spokane; Stevens; Whatcom."],
## BBOX[47.08,-124.79,49.05,-117.02]],
## ID["EPSG",6596]]
plot(wa_projected$geometry)
With a properly projected spatial data set, now we can apply any of the geospatial analyst functions (i.e., spatial overlays, intersections, buffers, etc.) without fear of distortion. From a pure mapping perspective, geographic coordinate systems—in \(^\circ\) latitude and longitude—can work, but we suggest to almost always use a properly projected coordinate reference system.
While understanding coordinate reference system transformations is important for spatial data management and geographic visualization, other types of geometric manipulations may be required by the spatial analyst.
A common problem for national display of the United States is the fragmented nature of US states and territories geographically. The continental United States can be displayed on a map in a relatively straightforward way, and there are a number of projected coordinate reference systems designed for correct display of the continental US. Often, analysts and cartographers then have to make decisions about how to handle Alaska, Hawaii, and Puerto Rico, which cannot be reasonably plotted using default US projections.
For example, let’s take a U.S. state-level shapefile obtained with tigris at low resolution and use ggplot2 to visualize it in the default geographic CRS, NAD 1983:
us_states <- states(cb = TRUE, resolution = "20m", progress_bar = FALSE)
ggplot(us_states) +
geom_sf() +
theme_minimal()
The plot does not work well, in part because the Aleutian Islands in far west Alaska cross the 180\(^\circ\) of longitude and are plotted on the opposite side of the map. In response, a projected coordinate reference system appropriate for the United States could be used, such as the continental US Albers Equal Area projection:
ggplot(us_states) +
geom_sf() +
coord_sf(crs = 'ESRI:102003') +
theme_minimal()
While this representation puts all territories in their appropriate locations, it is clearly not appropriate for Alaska, Hawaii, and Puerto Rico which appear distorted.
tigris offers a solution to this problem with the shift_geometry() function. shift_geometry() takes an opinionated approach to the shifting and rescaling of Alaska, Hawaii, and Puerto Rico geometries to offer four options for an alternative view of the US. The function works by projecting geometries in Alaska, Hawaii, and Puerto Rico to appropriate coordinate reference systems for those areas, then re-sizing the geometries (if requested) and moving them to an alternative layout in relationship to the rest of the US using the Albers Equal Area CRS.
us_states_shifted <- shift_geometry(us_states)
ggplot(us_states_shifted) +
geom_sf() +
theme_minimal()
This view uses two default arguments: preserve_area = FALSE, which shrinks Alaska and inflates Hawaii and Puerto Rico, and position = “below”, which places these areas below the continental United States. Alternatively, we can set preserve_area = ” “ and position = ” “ for a different view:
us_states_outside <- shift_geometry(us_states,
preserve_area = TRUE,
position = "outside")
ggplot(us_states_outside) +
geom_sf() +
theme_void()
Data from the United States Census Bureau are commonly visualized using maps, as both Census and ACS data are aggregated to enumeration units. Here, we will introduce the use of tidycensus and the geometry = TRUE argument within the get_acs(), get_decennial(), or get_estimates() function to investigate the spatial dimensions of Census and ACS data.
Consider 2020 ACS median household income at the Census tract level in the District of Columbia:
dc_income <- get_acs(
geography = "tract",
variables = "B19013_001",
state = "DC",
year = 2020,
geometry = TRUE,
progress_bar = TRUE
)
## | | | 0% | |============================================== | 66% | |======================================================================| 100%
head(dc_income)
## Simple feature collection with 6 features and 5 fields
## Geometry type: POLYGON
## Dimension: XY
## Bounding box: xmin: -77.08563 ymin: 38.87646 xmax: -77.009 ymax: 38.94584
## Geodetic CRS: NAD83
## GEOID NAME
## 1 11001001002 Census Tract 10.02, District of Columbia, District of Columbia
## 2 11001002801 Census Tract 28.01, District of Columbia, District of Columbia
## 3 11001003100 Census Tract 31, District of Columbia, District of Columbia
## 4 11001004001 Census Tract 40.01, District of Columbia, District of Columbia
## 5 11001010500 Census Tract 105, District of Columbia, District of Columbia
## 6 11001004600 Census Tract 46, District of Columbia, District of Columbia
## variable estimate moe geometry
## 1 B19013_001 140772 20766 POLYGON ((-77.08563 38.9382...
## 2 B19013_001 62323 22218 POLYGON ((-77.03645 38.9349...
## 3 B19013_001 133408 18433 POLYGON ((-77.02826 38.9318...
## 4 B19013_001 153650 5886 POLYGON ((-77.05018 38.9212...
## 5 B19013_001 92083 20106 POLYGON ((-77.01756 38.8850...
## 6 B19013_001 129896 18312 POLYGON ((-77.01811 38.9146...
Median household income data are found in the estimate column with associated margins of error in the moe column, along with a variable ID, GEOID, and Census tract name. The geometry column contains polygon feature geometry for each Census tract.
We can map the ACS data across space using plot() or the functions in either the ggplot2 or tmap packages.
Because dc_income is a simple features object, both geometry and attributes can be visualized with plot().
plot(dc_income["estimate"])
The plot() function returns a simple map showing income variation in Washington, DC. Wealthier areas, as represented with warmer colors, tend to be located in the northwestern part of the District. NA values are represented on the map in white.
ggplot2 can be used for quick plotting of sf objects using familiar ggplot2 syntax.
For example, let’s consider the 2019 ACS 1-year estimates for median age by state:
us_median_age <- get_acs(
geography = "state",
variables = "B01002_001",
year = 2019,
survey = "acs1",
geometry = TRUE,
resolution = "20m"
) %>%
shift_geometry()
## | | | 0% | |======================================================================| 100%
plot(us_median_age$geometry)
The state polygons can be styled using ggplot2 conventions and the geom_sf() function. With two lines of ggplot2 code, a basic map of median age by state can be created with ggplot2 defaults.
ggplot(data = us_median_age, aes(fill = estimate)) +
geom_sf()
Additional editing of the ggplot2 syntax can alter the cartographic design of the map. For instance, we can use the color palettes provided by RColorBrewer in the scale_fill_distiller() function. We can also alter the title, caption, and legend using the labs() function, and apply a theme_minimal() function.
ggplot(data = us_median_age, aes(fill = estimate)) +
geom_sf() +
scale_fill_distiller(palette = "PuBu") +
labs(title = " Median Age by State, 2019",
caption = "Data source: 2019 1-year ACS, US Census Bureau",
fill = "ACS estimate") +
theme_minimal()
Moving away from a general graphic package in ggplot2, the tmap package is an excellent alternative for mapping in R that includes a wide range of functionality for custom cartography.
Let’s consider total population by Cenus tract in King County, Washington.
v20 <- load_variables(2020, dataset = "pl")
king_pop <- get_decennial(
geography = "tract",
state = "WA",
county = "King",
variables = c(Total_Pop = "P1_001N"),
year = 2020,
geometry = TRUE
)
## | | | 0% | |================================================================== | 94% | |=================================================================== | 95% | |==================================================================== | 96% | |======================================================================| 100%
head(king_pop)
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -122.2305 ymin: 47.29001 xmax: -122.1285 ymax: 47.77669
## Geodetic CRS: NAD83
## # A tibble: 6 × 5
## GEOID NAME variable value geometry
## <chr> <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 53033024400 Census Tract 244, King C… Total_P… 3053 (((-122.2299 47.58662, -…
## 2 53033025806 Census Tract 258.06, Kin… Total_P… 5303 (((-122.2179 47.46793, -…
## 3 53033021803 Census Tract 218.03, Kin… Total_P… 6097 (((-122.2232 47.7658, -1…
## 4 53033031908 Census Tract 319.08, Kin… Total_P… 4470 (((-122.1539 47.45231, -…
## 5 53033022703 Census Tract 227.03, Kin… Total_P… 2655 (((-122.1872 47.6696, -1…
## 6 53033030700 Census Tract 307, King C… Total_P… 4317 (((-122.2289 47.2972, -1…
King County, Washington is an example of a complex Census tract geometry, given that one tract has no population (waterways surrounding Seattle). We should remove this tract before proceeding with any mapping. To do this, use the dplyr package and its piping (%>%) syntax.
missing_pop <- "53033990100"
king_pop <- king_pop %>%
filter(!GEOID %in% missing_pop)
tmap’s map-making syntax will be somewhat familiar to users of ggplot2, as it uses the concept of layers to specify modifications to the map. The map object is initialized with tm_shape(), which then allows us to view the Census tracts with tm_polygons().
library(tmap)
tm_shape(king_pop) +
tm_polygons("value", palette = "PuBu")
tmap uses a classed color scheme rather than the continuous palette used by ggplot2, by default. This involves the identification of “classes” in the distribution of data values and mapping a color from a color palette to data values that belong to each class. The default classification scheme used by tm_fill() is “pretty”, which identifies clean-looking intervals in the data based on the data range.
Just like with ggplot2, you can alter the arguments within the tmap package (i.e., tm_layout() function) to improve the map.
tm_shape(king_pop) +
tm_polygons("value", palette = "PuBu", title = "Total Population") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
You can change the classification scheme, as well using the style() argument within the tm_polygons() function. tmap offers a number of classification schemes, including “cat”, “fixed”, “sd”, “equal”, “pretty”, “quantile”, “kmeans”, “hclust”, “bclust”, “fisher”, “jenks”, “dpih”, “headtails”, and “log10_pretty”.
tm_shape(king_pop) +
tm_polygons("value", palette = "PuBu", style = "quantile", title = "Total Population") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
tidycensus offers all available Census and ACS variables included in a Census API. This makes other map types (e.g. dot-density maps) easier to navigate using functions like as_dot_density(). Let’s try to create a dot-density map of race in King County, Washington and plot it here:
king_race <- get_decennial(
geography = "tract",
state = "WA",
county = "King",
variables = c(
Hispanic = "P2_002N",
White = "P2_005N",
Black = "P2_006N",
Native = "P2_007N",
Asian = "P2_008N"),
summary_var = "P2_001N",
year = 2020,
geometry = TRUE
) %>%
mutate(percent = 100 * (value / summary_value))
Remove any zero populations/percentages
king_race <- king_race %>%
filter(value > 0)
Now back to the dot-density map of race in King County, Washington.
king_dots <- king_race %>%
as_dot_density(
value = "value",
values_per_dot = 100,
group = "variable"
)
The map itself is created with the tm_dots() function, which in this example is combined with a background map using tm_polygons() to show the relative racial and ethnic heterogeneity of Census tracts in King County.
background_tracts <- filter(king_race, variable == "White")
tm_shape(background_tracts) +
tm_polygons(col = "white",
border.col = "grey") +
tm_shape(king_dots) +
tm_dots(col = "variable",
palette = "Set1",
size = 0.005,
title = "1 dot = 100 people") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
Areal interpolation is the process making estimates from a source set of polygons to an overlapping but non-congruent set of target polygons. This technique is helpful if a project requires spatial analysis at the scale of a geometry not included in the Census enumerations.
Simply areal interpolations can be performed using functions in the areal package. Two function prefixes are used in areal to allow users to take advantage of R’s auto complete functionality:
ar_ - data and functions that are used for multiple interpolation methods
aw_ - functions that are used specifically for areal weighted interpolation
Let’s consider a problem where HOLC Neighborhood designations are wanted to express socioeconomic and demographic data.
Read in the HOLC spatial data and plot it according to neighborhood designation. Shreveport HOLC designations are found here.
ShreveportHOLC <- st_read("Data/Shreveport.json")
## Reading layer `Shreveport' from data source
## `/Users/jlg0003/Dropbox/D4-Census-in-R-Python/Data/Shreveport.json'
## using driver `GeoJSON'
## Simple feature collection with 27 features and 14 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -93.79743 ymin: 32.43106 xmax: -93.70769 ymax: 32.53478
## Geodetic CRS: WGS 84
ShreveportHOLC <- ShreveportHOLC %>%
filter(!grade == "NA") # remove industrial and commerical grades
pal = c("darkgreen", "blue", "yellow", "red")
#tmap_mode("view")
tm_shape(ShreveportHOLC) +
tm_polygons(col = "grade", pal = pal, title = "HOLC Grade") +
tm_layout(frame = FALSE,
legend.outside = FALSE, legend.position = c("LEFT", "bottom"))
Now let’s grab Census tracts for Shreveport.
shreveport_pop <- get_decennial(
geography = "tract",
state = "LA",
county = c("Caddo", "Bossier"),
variables = c(Total_Pop = "P1_001N"),
year = 2020,
geometry = TRUE
)
## | | | 0% | |================= | 25% | |============================ | 39% | |======================================= | 56% | |=========================================== | 62% | |======================================================= | 79% | |==================================================================== | 96% | |===================================================================== | 99% | |======================================================================| 100%
head(shreveport_pop)
## Simple feature collection with 6 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -93.88961 ymin: 32.40204 xmax: -93.63345 ymax: 32.48594
## Geodetic CRS: NAD83
## # A tibble: 6 × 5
## GEOID NAME variable value geometry
## <chr> <chr> <chr> <dbl> <MULTIPOLYGON [°]>
## 1 22017022200 Census Tract 222, Caddo … Total_P… 3781 (((-93.84627 32.46081, -…
## 2 22017023600 Census Tract 236, Caddo … Total_P… 2775 (((-93.84245 32.45697, -…
## 3 22017024201 Census Tract 242.01, Cad… Total_P… 3926 (((-93.88551 32.4134, -9…
## 4 22015010808 Census Tract 108.08, Bos… Total_P… 4532 (((-93.67615 32.47969, -…
## 5 22015010807 Census Tract 108.07, Bos… Total_P… 3732 (((-93.6642 32.4628, -93…
## 6 22017023401 Census Tract 234.01, Cad… Total_P… 3712 (((-93.80738 32.4421, -9…
tm_shape(shreveport_pop) +
tm_polygons("value", palette = "PuBu", style = "quantile", title = "Total Population") +
tm_layout(frame = FALSE,
legend.outside = TRUE)
To find total population in the HOLC neighborhood grades, we need to areal interpolate to the spatial dimensions of the HOLC neighborhoods. Start by adding an “ID” column to both spatial data sets for the aw_interpolate() function to properly work. In addition, it is important to ensure the CRS for each spatial data set is projected in the same reference system. We do this below with the st_transform() function.
library(areal)
ShreveportHOLC$ID <- 1:nrow(ShreveportHOLC)
shreveport_pop$ID2 <- 1:nrow(shreveport_pop)
ShreveportHOLC <- ShreveportHOLC %>%
st_set_crs(4326) %>%
st_transform(., 5070)
shreveport_pop <- shreveport_pop %>%
st_set_crs(5070)
interpolated_output <- aw_interpolate(ShreveportHOLC,
tid = "ID",
source = shreveport_pop,
sid = "ID2",
extensive = c("value"),
weight = "total",
output = "sf")
interpolated_output <- interpolated_output %>%
mutate(value = round(value))
Now we have a total population count (“value”) for each HOLC neighborhood designation (26 polygons).
We can then map total population by neighborhood individually:
tm_shape(interpolated_output) +
tm_polygons("value", palette = "PuBu", style = "quantile", title = "Total Population") +
tm_layout(frame = FALSE,
legend.outside = FALSE, legend.position = c("LEFT", "bottom"))
Or we can map total population by neighborhood grouped:
interpolated_output_grouped <- interpolated_output %>%
group_by(grade) %>%
summarize(value = sum(value))
tm_shape(interpolated_output_grouped) +
tm_polygons("value", palette = "PuBu", style = "quantile", title = "Total Population") +
tm_layout(frame = FALSE,
legend.outside = FALSE, legend.position = c("LEFT", "bottom"))
Plot with the HOLC neighborhood designations side-by-side.
A <- tm_shape(ShreveportHOLC) +
tm_polygons(col = "grade", pal = pal, title = "HOLC Grade") +
tm_style('white', main.title ="A", main.title.position = "left") +
tm_layout(frame = FALSE,
legend.outside = FALSE)
B <- tm_shape(interpolated_output_grouped) +
tm_polygons("value", palette = "PuBu", style = "quantile", title = "Total Population") +
tm_style('white', main.title ="B", main.title.position = "left") +
tm_layout(frame = FALSE,
legend.outside = FALSE, legend.position = c("LEFT", "bottom"))
tmap_arrange(A, B, ncol = 2, asp = TRUE)
Happy mapping and spatial analysis with Census and ACS estimates in R!